Homework 3¶

Linear Algebra - Gaussian Elimination, SVD, Polynomial Regression, PCA, KNN, and Data Modeling¶


This notebook is arranged in cells. Texts are usually written in the markdown cells, and here you can use html tags (make it bold, italic, colored, etc). You can double click on this cell to see the formatting.

The ellipsis (...) are provided where you are expected to write your solution but feel free to change the template (not over much) in case this style is not to your taste.

Hit "Shift-Enter" on a code cell to evaluate it. Double click a Markdown cell to edit.


Imports¶

In [1]:
import numpy as np
from scipy.integrate import quad
import sklearn as sk
from sklearn import datasets, linear_model
from sklearn.preprocessing import PolynomialFeatures
#For plotting
import matplotlib.pyplot as plt
%matplotlib inline

Mounting Google Drive locally¶

Mount your Google Drive on your runtime using an authorization code.

Note: When using the 'Mount Drive' button in the file browser, no authentication codes are necessary for notebooks that have only been edited by the current user.

In [2]:
#plot goodies
def canvas_ticks(obj):
    '''This provides ticks in to a blanck canvas, for singular plots
    use plt as the argumenet, for suplots, in for gridspec for expample
    insert ax1 as argument'''
    obj.minorticks_on()
    obj.tick_params(labelsize=14)
    obj.tick_params(labelbottom=True, labeltop=False, labelright=False, labelleft=True)
    obj.tick_params(direction='in',which='minor', length=5, bottom=True, top=True, left=True, right=True)
    obj.tick_params(direction='in',which='major', length=10, bottom=True, top=True, left=True, right=True)

plt.style.use('default')
plt.rcParams['mathtext.fontset'] = 'stix'
plt.rcParams['font.family'] = 'STIXGeneral'
plt.rcParams['font.size'] = 14
error_bar_settings = {
        'fmt': 'o',
        'ms':6,
        # 'mfc': plot_color,
        'ecolor': 'black',
        'mec': 'black',
        'capsize': 2.,
        'mew': .5,
        'elinewidth': .7,
        # 'alpha': 0.85,
    }
In [3]:
from google.colab import drive
drive.mount('/content/drive')
Mounted at /content/drive

Problem 1 - Solving Least Squares Using Normal Equations and SVD¶

(Reference - NR 15.4) We fit a set of 50 data points $(x_i, y_i)$ to a polynomial $y(x) = a_0 + a_1x + a_2x^2 + a_3x^3$. (Note that this problem is linear in $a_i$ but nonlinear in $x_i$). The uncertainty $\sigma_i$ associated with each measurement $y_i$ is known, and we assume that the $x_i$'s are known exactly. To measure how well the model agrees with the data, we use the chi-square merit function:
$$ \chi^2 = \sum_{i=0}^{N-1} \big( \frac{y_i-\sum_{k=0}^{M-1}a_k x^k}{\sigma_i} \big)^2. $$
where N = 50 and M = 4. Here, $1, x, ... , x^3$ are the basis functions.

1. Plot data. (Hint - https://matplotlib.org/api/_as_gen/matplotlib.axes.Axes.errorbar.html)

Again, in this and all future assignments, you are expected to show error bars in all figures if the data include uncertainties. You will lose points if error bars are not shown.

In [4]:
# Load a given 2D data
data = np.loadtxt("/content/drive/My Drive/P188_288/P188_288_HW3/Problem1_data.dat")
x = data[:,0]
y = data[:,1]
sig_y = data[:,2]
In [5]:
# Make plot
plt.figure(figsize = (10,7), dpi = 200)
# Scatter plot
plt.errorbar(x,y,yerr = sig_y, **error_bar_settings,)
canvas_ticks(plt)
plt.xlabel("x")
plt.ylabel("y")
Out[5]:
Text(0, 0.5, 'y')

We will pick as best parameters those that minimize $\chi^2$.

First, let $\bf A$ be a matrix whose $N \times M$ components are constructed from the $M$ basis functions evaluated at the $N$ abscissas $x_i$, and from the $N$ measurement errors $\sigma_i$, by the prescription $$ A_{ij} = \frac{X_j(x_i)}{\sigma_i} $$
where $X_0(x) = 1,\ X_1(x) = x,\ X_2(x) = x^2,\ X_3(x) = x^3$. We call this matrix $\bf A$ the design matrix.

Also, define a vector $\bf b$ of length $N$ by $$ b_i = \frac{y_i}{\sigma_i} $$
and denote the $M$ vector whose components are the parameters to be fitted ($a_0, a_1, a_2, a_3$) by $\bf a$.

2. Define the design matrix A. (Hint: Its dimension should be NxM = 50x4.) Also, define the vector b. Print the first row of A.

In [6]:
N = 50
M = 4
# Define A
A = np.zeros(shape=(N,M))
for i in range(N):
  for j in range(M):
    A[i,j] = x[i]**j / sig_y[i]
# Define b
b = y/sig_y

# Print the first row of A
A[0,:]
Out[6]:
array([0.30368985, 0.60162612, 1.1918541 , 2.36112786])

Minimize $\chi^2$ by differentiating it with respect to all $M$ parameters $a_k$ vaishes. This condition yields the matrix equation
$$ \sum_{j=0}^{M-1} \alpha_{kj}a_j = \beta_k$$
where $\bf \boldsymbol \alpha = A^T \cdot A$ and $\bf \boldsymbol \beta = A^T \cdot b$ ($\boldsymbol \alpha$ is an $M \times M$ matrix, and $\boldsymbol \beta$ is a vector of length $M$). This is the normal equation of the least squares problem. In matrix form, the normal equations can be written as: $$ \bf \boldsymbol \alpha \cdot a = \boldsymbol \beta. $$

This can be solved for the vector of parameters $\bf a$ by linear algebra numerical methods.

3. Define the matrix alpha and vector beta. Print both alpha and beta.

In [6]:
 
In [7]:
# Transpose of the matrix A
A_transpose = A.T
# alpha matrix
alpha = np.dot(A_transpose,A, )
# stuff changes with alpha
Alpha = alpha.copy()
# beta vector
beta = np.dot(A_transpose, b )

# Print alpha and beta
print("Alpha:\n",alpha)
print("Beta:\n",beta)
Alpha:
 [[7.57344292e+00 1.59581405e+01 1.20838371e+02 4.80208969e+02]
 [1.59581405e+01 1.20838371e+02 4.80208969e+02 3.20704253e+03]
 [1.20838371e+02 4.80208969e+02 3.20704253e+03 1.62887892e+04]
 [4.80208969e+02 3.20704253e+03 1.62887892e+04 1.05149671e+05]]
Beta:
 [  118.53904396   727.88040211  3581.30337095 22023.93157276]

4. We have $ \bf \boldsymbol \alpha \cdot a = \boldsymbol \beta. $ Solve for $\bf a$ using (1) "GaussianElimination_pivot" defined below (2) LU decomposition and forward subsitution and backsubstitution. Plot the best-fit line on top of the data.

Hint: You can use scipy.linalg.lu to do the LU decomposition. After you do "L, U = lu(A, permute_l=True)," print L and U matrices. Note that L is not a lower triangle matrix. Swap rows of L (and B) and make it a lower triangular matrix. And then, solve for y in Ly = B.

e.g. If your L matrix is the following:

[[ 0.01577114 0.10593754 0.41569921 1. ] [ 0.03323166 -0.04364428 1. 0. ] [ 0.25163705 1. 0. 0. ] [ 1. 0. 0. 0. ]],

you can change it to this:

[[ 1. 0. 0. 0. ] [ 0.25163705 1. 0. 0. ] [ 0.03323166 -0.04364428 1. 0. ] [ 0.01577114 0.10593754 0.41569921 1. ]]

Then, you should also change B from

[ 118.53904396 727.88040211 3581.30337095 22023.93157276]

to

[22023.93157276 3581.30337095 727.88040211 118.53904396].

In [8]:
def GaussianElimination_pivot(A, b):

    N = len(b)

    for m in range(N):

        # Check if A[m,m] is the largest value from elements bellow and perform swapping
        for i in range(m+1,N):
            if A[m,m] < A[i,m]:
                A[[m,i],:] = A[[i,m],:]
                b[[m,i]] = b[[i,m]]

        # Divide by the diagonal element
        div = A[m,m]
        A[m,:] /= div
        b[m] /= div

        # Now subtract from the lower rows
        for i in range(m+1,N):
            mult = A[i,m]
            A[i,:] -= mult*A[m,:]
            b[i] -= mult*b[m]

    # Backsubstitution
    x = np.empty(N,float)
    for m in range(N-1,-1,-1):
        x[m] = b[m]
        for i in range(m+1,N):
            x[m] -= A[m,i]*x[i]

    return x
In [9]:
# Using the Gaussian elimination with partial pivoting
a = GaussianElimination_pivot(alpha, beta)

print('Using Gaussian Elimination:')
print('a0 =', a[0], ', a1 =', a[1], ', a2 =', a[2], ', a3 =', a[3])
Using Gaussian Elimination:
a0 = -0.030816295372758873 , a1 = 2.667646082249882 , a2 = 0.31483927007853074 , a3 = 0.07945935335134478
In [10]:
# "lu" does LU decomposition with pivot. Reference - https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.linalg.lu_factor.html
from scipy.linalg import lu

def solve_lu_pivot(A, B):
    # LU decomposition with pivot
    L, U = lu(A, permute_l=True)

    N = len(B)

    # forward substitution: We have Ly = B. Solve for y
    L_inv = np.linalg.inv(L)
    y = np.dot(L_inv, B)

    # backward substitution: We have y = Ux. Solve for x.
    U_inv = np.linalg.inv(U)
    x = np.dot(U_inv, y)

    return x

aL = solve_lu_pivot(alpha, beta)

print('Using LU Decomposition:')
print('a0 =', aL[0], ', a1 =', aL[1], ', a2 =', aL[2], ', a3 =', aL[3])
Using LU Decomposition:
a0 = -0.030816295372751767 , a1 = 2.667646082249882 , a2 = 0.31483927007853074 , a3 = 0.07945935335134478
In [11]:
x_model = np.linspace(min(x), max(x), 1000)
# Make plot
plt.figure(figsize = (10,7), dpi = 200)
# Scatter plot
plt.errorbar(x,y,yerr = sig_y, **error_bar_settings,)
poly_a = np.poly1d(a[::-1])
poly_aL = np.poly1d(aL[::-1])
plt.plot(x_model, poly_a(x_model), 'k',label = 'Gaussian Elimination')
plt.plot(x_model, poly_aL(x_model), 'r--', label = 'LU decomposition')

# plt.plot()
canvas_ticks(plt)
plt.xlabel("x")
plt.ylabel("y")
plt.legend(frameon = False)
plt.show()

The inverse matrix $\bf C = \boldsymbol \alpha^{-1}$ is called the covariance matrix, which is closely related to the probable uncertainties of the estimated parameters $\bf a$. To estimate these uncertainties, we compute the variance associated with the estimate $a_j$. Following NR p.790, we obtain:

$$ \sigma^2(a_j) = \sum_{k=0}^{M-1} \sum_{l=0}^{M-1} C_{jk} C_{jl} \alpha_{kl} = C_{jj} $$
5. Compute the error (standard deviation - square root of the variance) on the fitted parameters using the covariance matrix.

In [12]:
from scipy.linalg import inv
# Covariance matrix

cov = inv(Alpha)

sigma_a0 = np.sqrt(cov[0,0])
sigma_a1 =  np.sqrt(cov[1,1])
sigma_a2 =  np.sqrt(cov[2,2])
sigma_a3 =  np.sqrt(cov[3,3])
print('Error: on a0 =', sigma_a0, ', on a1 =', sigma_a1, ', on a2 =', sigma_a2, ', on a3 =', sigma_a3)
Error: on a0 = 0.7139418927087803 , on a1 = 0.22390274023290208 , on a2 = 0.06344814294696766 , on a3 = 0.01199869685671923

Now, instead of using the normal equations, we use singular value decomposition (SVD) to find the solution of least squares. Please read Ch. 15 of NR for more details. Remember that we have the $N \times M$ design matrix $\bf A$ and the vector $\bf b$ of length $N$. We wish to mind $\bf a$ which minimizes $\chi^2 = |\bf A \cdot a - b|^2$.

Using SVD, we can decompose $\bf A$ as the product of an $N \times M$ column-orthogonal matrix $\bf U$, an $M \times M$ diagonal matrix $\bf S$ (with positive or zero elements - the "singular" values), and the transpose of an $M \times M$ orthogonal matrix $\bf V$. ($\bf A = USV^{T}$).
Let $\bf U_{(i)}$ and $\bf V_{(i)}$ denote the columns of $\bf U$ and $\bf V$ respectively (Note: We get $M$ number of vectors of length $M$.) $\bf S_{(i,i)}$ are the $i$th diagonal elements (singular values) of $\bf S$. Then, the solution of the above least squares problem can be written as:
$$ \bf a = \sum_{i=1}^M \big( \frac{U_{(i)} \cdot b}{S_{(i,i)}} \big) V_{(i)}. $$

The variance in the estimate of a parameter $a_j$ is given by: $$ \sigma^2(a_j) = \sum_{i=1}^M \big( \frac{V_{ji}}{S_{ii}} \big)^2 $$
and the covariance: $$ \mathrm{Cov}(a_j, a_k) = \sum_{i=1}^M \big( \frac{V_{ji}V_{ki}}{S_{ii}^2} \big). $$

6. Decompose the design matrix A using SVD. Estimate the parameter $a_i$'s and its variance.

In [13]:
# Reference - https://docs.scipy.org/doc/numpy-1.13.0/refberence/generated/numpy.linalg.svd.html
from scipy.linalg import svd

# Decompose A
# Note: S, in this case, is a vector of length M, which contains the singular values.
U, S, VT = svd(A, full_matrices=False)
V = VT.T
print(U[:,0].shape)
# Solve for a
a_ = []
for i in range(M):
  a_i = np.dot(U[:,i], b)/S[i] * V[:,i] #[row,column]
  a_.append(a_i)
a_from_SVD = np.sum(a_, axis = 0)
print('Using SVD:')
print('a0 =', a_from_SVD[0], ', a1 =', a_from_SVD[1], ', a2 =', a_from_SVD[2], ', a3 =', a_from_SVD[3])
(50,)
Using SVD:
a0 = -0.030816295372707803 , a1 = 2.6676460822498824 , a2 = 0.3148392700785285 , a3 = 0.07945935335134474
In [14]:
# Error on a
sigma = np.zeros(M)
for i in range(M):
  sigma_i = (V[:,i]/S[i])**2
  sigma += sigma_i
sigma_a_SVD = np.sqrt(sigma)

print('Error: on a0 =', sigma_a_SVD[0], ', on a1 =', sigma_a_SVD[1], ', on a2 =', sigma_a_SVD[2], ', on a3 =', sigma_a_SVD[3])
Error: on a0 = 0.7139418927087802 , on a1 = 0.22390274023290271 , on a2 = 0.06344814294696767 , on a3 = 0.011998696856719273

Suppose that you are only interested in the parameters $a_0$ and $a_1$. We can plot the 2-dimensional confidence region ellipse for these parameters by building the covariance matrix: $$ \mathrm{C'} = \binom{\sigma({a_0})^2\ \ \ \ \ \ \mathrm{Cov}({a_0, a_1})}{\mathrm{Cov}({a_0, a_1}) \ \ \ \ \ \ \sigma({a_1})^2} $$

The lengths of the ellipse axes are the square root of the eigenvalues of the covariance matrix, and we can calculate the counter-clockwise rotation of the ellipse with the rotation angle: $$ \theta = \frac{1}{2} \mathrm{arctan}\Big( \frac{2\cdot \mathrm{Cov}({a_0, a_1})}{\sigma({a_0})^2-\sigma({a_1})^2} \Big) = \mathrm{arctan}(\frac{\vec{v_1}(y)}{\vec{v_1}(x)}) $$
where $\vec{v_1}$ is the eigenvector with the largest eigenvalue. So we calculate the angle of the largest eigenvector towards the x-axis to obtain the orientation of the ellipse.


Then, we multiply the axis lengths by some factor depending on the confidence level we are interested in. For 68%, this scale factor is $\sqrt{\Delta \chi^2} \approx 1.52$. For 95%, it is $\approx 2.48$.

7. Compute the covariance between $a_0$ and $a_1$. Plot the 68% and 95% confidence region of the parameter $a_0$ and $a_1$. The skeleton code is already provided.

In [15]:
from matplotlib.patches import Ellipse
import matplotlib as mpl
from numpy.linalg import eigvals
In [16]:
cov = np.zeros(2)

for i in range(4):
  cov_i = V[0,i] * V[1,i]/ S[i]**2
  cov += cov_i
cov
Out[16]:
array([-0.05496243, -0.05496243])
In [17]:
# Build the covariance matrix
CovM = np.array([[sigma_a_SVD[0]**2, cov[1]],
                  [cov[0], sigma_a_SVD[1]**2]])
CovM
Out[17]:
array([[ 0.50971303, -0.05496243],
       [-0.05496243,  0.05013244]])
In [18]:
# Plot the confidence region (https://stackoverflow.com/questions/32371996/python-matplotlib-how-to-plot-ellipse)

eigvec, eigval, u = np.linalg.svd(CovM)

# Semimajor axis (diameter)
semimaj = np.sqrt(eigval[0])*2.
# Semiminor axis (diameter)
semimin = np.sqrt(eigval[1])*2.

theta = np.arctan(eigvec[0][1]/eigvec[0][0])

# Plot 1-sig confidence region
ell = mpl.patches.Ellipse(xy=[a[0], a[1]], width=1.52*semimaj, height=1.52*semimin, angle = theta*180/np.pi, facecolor = 'dodgerblue', edgecolor = 'royalblue', label = '68% confidence')
# Plot 2-sig confidence region
ell2 = mpl.patches.Ellipse(xy=[a[0], a[1]], width=2.48*semimaj, height=2.48*semimin, angle = theta*180/np.pi, facecolor = 'skyblue', edgecolor = 'royalblue', label = '95% confidence')

fig, ax = plt.subplots(figsize=(7,7))

ax.add_patch(ell2)
ax.add_patch(ell)


# Set bounds for x,y axes
bounds = np.sqrt(CovM.diagonal())
plt.xlim(a[0]-4*bounds[0], a[0]+4*bounds[0])
plt.ylim(a[1]-4*bounds[1], a[1]+4*bounds[1])

plt.grid(True)
plt.xlabel('$a_0$')
plt.ylabel('$a_1$')
plt.legend()
plt.show()

In lecture, we discussed that we fit the existing data to obtain model parameters in data analysis, while in machine learning we use the model derived from the existing data to make prediction for new data.

Next, let us take the given data and do the polynomial regression.

First, split the sample into training data and the testing data. Keep 80% data as training data and uses the remaining 20% data for testing.

8. Often, the data can be ordered in a specific manner, hence shuffle the data prior to splitting it into training and testing samples. (Use https://docs.scipy.org/doc/numpy/reference/generated/numpy.random.shuffle.html)

In [19]:
rand = np.arange(N)
np.random.shuffle(rand)
x, y, sig_y = x[rand], y[rand], sig_y[rand]

train_x, test_x = np.split(x, [int(len(x) * 0.8)])
train_y, test_y = np.split(y, [int(len(y) * 0.8)])
train_sigy, test_sigy = np.split(sig_y, [int(len(sig_y) * 0.8)])
In [19]:
 

In the case of polynomial regression, we need to generate polynomial features (http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html) for preprocessing. Note that we call each term in the polynomial as a "feature" in our model, and here we generate features' high-order (and interaction) terms. For example, suppose we set the degree of the polynomial to be 3. Then, the features of $X$ is transformed from $(X)$ to $(1, X, X^2, X^3)$. We can do this transform using PolynomialFeatures.fit_transform(train_x). But fit_transform() takes the numpy array of shape [n_samples, n_features]. So you need to re-define our training set as train_set_prep = train_x[:,np.newaxis] so that it has the shape [40,1].

9. Define three different polynomial models with degree of 1, 3, 10. (e.g. model = PolynomialFeatures(degree=...) ) Then, fit to data and transform it using "fit_transform"

In [20]:
# e.g.

model_deg1 = PolynomialFeatures(degree = 1)
X_model_deg_1 = model_deg1.fit_transform(train_x[:,np.newaxis])

model_deg3 = PolynomialFeatures(degree = 3)
X_model_deg_3 = model_deg3.fit_transform(train_x[:,np.newaxis])

model_deg10 = PolynomialFeatures(degree = 10)
X_model_deg_10 = model_deg10.fit_transform(train_x[:,np.newaxis])

Then, do the least squares linear regression. (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression.fit)

  1. define the object for linear regression: LR = linear_model.LinearRegression()
  2. Fit the linear model to the training data: LR.fit(transformed x data, y data)
  3. Define new x samples for plotting: X_sample = np.linspace(-5, 7, 100)
  4. Transform x sample: X_sample_transform = model.fit_transform(X_sample[:,np.newaxis])
  5. Predict using the linear model: Y_sample = LR.predict(X_sample_transform)
  6. Plot the fit: plt.plot(X_sample, Y_sample)

10. Do the linear regression for three different polynomial models defined in Part 9. Plot the fit on top of the training data (Label each curve).

In [21]:
#deg 1
LR_deg1 = linear_model.LinearRegression()
LR_deg1.fit(X_model_deg_1, train_y)
X_deg1_sample = np.linspace(-5, 7, 100)
X_deg1_sample_transform = model_deg1.fit_transform(X_deg1_sample[:,np.newaxis])
LR_deg1.predict(X_deg1_sample_transform)
Y_deg1_sample = LR_deg1.predict(X_deg1_sample_transform)
#deg 3
LR_deg3 = linear_model.LinearRegression()
LR_deg3.fit(X_model_deg_3, train_y)
X_deg3_sample = np.linspace(-5, 7, 100)
X_deg3_sample_transform = model_deg3.fit_transform(X_deg3_sample[:,np.newaxis])
LR_deg3.predict(X_deg3_sample_transform)
Y_deg3_sample = LR_deg3.predict(X_deg3_sample_transform)

#deg 10
LR_deg10 = linear_model.LinearRegression()
LR_deg10.fit(X_model_deg_10, train_y)
X_deg10_sample = np.linspace(-5, 7, 100)
X_deg10_sample_transform = model_deg10.fit_transform(X_deg10_sample[:,np.newaxis])
LR_deg10.predict(X_deg10_sample_transform)
Y_deg10_sample = LR_deg10.predict(X_deg10_sample_transform)
In [22]:
# Make plot
plt.figure(figsize = (10, 7))
plt.plot(X_deg1_sample, Y_deg1_sample, label = "Deg 1")
plt.plot(X_deg3_sample, Y_deg3_sample, label = "Deg 2")
plt.plot(X_deg10_sample, Y_deg10_sample, label = "Deg 3")

plt.xlabel('$x$')
plt.ylabel('$y$')
plt.xlim(-5, 7)
plt.ylim(-12, 65)
plt.legend(frameon = False)
canvas_ticks(plt)
plt.show()

11. Plot the fit on top of the test data (Label each curve).

In [23]:
# Make plot
plt.figure(figsize = (10, 7))
plt.errorbar(train_x, train_y , train_sigy, **error_bar_settings)
plt.plot(X_deg1_sample, Y_deg1_sample, label = "Deg 1")
plt.plot(X_deg3_sample, Y_deg3_sample, label = "Deg 3")
plt.plot(X_deg10_sample, Y_deg10_sample, label = "Deg 10")

plt.xlabel('$x$')
plt.ylabel('$y$')
plt.xlim(-5, 7)
plt.ylim(-12, 65)
plt.legend(frameon = False)
canvas_ticks(plt)
plt.show()

You can obtain the estimated linear coefficients using linearmodel.LinearRegression.coef (http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html#sklearn.linear_model.LinearRegression)

12. Print the linear coefficients of three polynomial models you used. For the polynomial of degree 10, do you see that high-order coefficients are very small?

In [24]:
print(f"The coef for deg 1:\n {LR_deg1.coef_}")
print(f"The coef for deg 3:\n {LR_deg3.coef_}")
print(f"The coef for deg 10:\n {LR_deg10.coef_}")
The coef for deg 1:
 [0.         5.15206924]
The coef for deg 3:
 [0.         2.71126276 0.34082605 0.07220783]
The coef for deg 10:
 [ 0.00000000e+00  3.03479834e+00 -2.45664835e+00  7.87278718e-02
  6.05911657e-01 -5.79006451e-02 -4.15401822e-02  7.00303668e-03
  6.70493974e-04 -2.08349279e-04  1.17336787e-05]

I see the higher order terms tend to be zero!¶


Problem 2 - Applying the PCA Method on Quasar Spectra¶

The following analysis is based on https://arxiv.org/pdf/1208.4122.pdf.

"Principal Component Analysis (PCA) is a powerful and widely used technique to analyze data by forming a custom set of “principal component” eigenvectors that are optimized to describe the most data variance with the fewest number of components. With the full set of eigenvectors the data may be reproduced exactly, i.e., PCA is a transformation which can lend insight by identifying which variations in a complex dataset are most significant and how they are correlated. Alternately, since the eigenvectors are optimized and sorted by their ability to describe variance in the data, PCA may be used to simplify a complex dataset into a few eigenvectors plus coefficients, under the approximation that higher-order eigenvectors are predominantly describing fine tuned noise or otherwise less important features of the data." (S. Bailey, arxiv: 1208.4122)

In this problem, we take the quasar (QSO) spectra from the Sloan Digital Sky Survey (SDSS) and apply PCA to them. Filtering for high $S/N$ in order to apply the standard PCA, we select 18 high-$S/N$ spectra of QSOs with redshift 2.0 < z < 2.1, trimmed to $1340 < \lambda < 1620\ \mathring{A}$.

In [25]:
# Load data
wavelength = np.loadtxt("/content/drive/My Drive/P188_288/P188_288_HW3/Problem2_wavelength.txt")
flux = np.loadtxt("/content/drive/My Drive/P188_288/P188_288_HW3/Problem2_QSOspectra.txt")
In [26]:
# Data dimension
print( np.shape(wavelength) )
print( np.shape(flux) )
(824,)
(18, 824)

In the above cell, we load the following data: wavelength in Angstroms ("wavelength") and 2D array of spectra x fluxes ("flux").

We have 824 wavelength bins, so "flux" is 18 $\times$ 824 matrix, each row containing fluxes of different QSO spectra.

1. Plot any three QSO spectra flux as a function of wavelength. (In order to better see the features of QSO spectra, you may plot them with some offsets.)

In [27]:
# Make plot
plt.figure(figsize = (8, 4), dpi = 200)
# plt.errorbar(train_x, train_y , train_sigy, **error_bar_settings)

plt.xlabel('Wavelength $[ \AA] $')
plt.ylabel('$f_{\lambda}+$ constant')
for i in range(3):
  plt.plot(wavelength, flux[i]+i)
plt.legend(frameon = False)
canvas_ticks(plt)
plt.show()
WARNING:matplotlib.legend:No artists with labels found to put in legend.  Note that artists whose label start with an underscore are ignored when legend() is called with no argument.
In [27]:
 

"Flux" is the data matrix of order 18 $\times$ 824. Call this matrix $\bf X$.

We can construct the covariance matrix $\bf C$ using the mean-centered data matrix. First, calculate the mean of each column and subtracts this from the column. Let $\bf X_c$ denote the mean-centered data matrix.

$\bf X_c$ $ = \begin{bmatrix} x_{(1,1)} - \overline{x}_1 & x_{(1,2)} - \overline{x}_2 & \dots & x_{(1,824)} - \overline{x}_{824} \\ x_{(2,1)} - \overline{x}_1 & x_{(2,2)} - \overline{x}_2 & \dots & x_{(2,824)} - \overline{x}_{824} \\ \vdots & \vdots & \vdots & \vdots \\ x_{(18,1)} - \overline{x}_1 & x_{(18,2)} - \overline{x}_2 & \dots & x_{(18,824)} - \overline{x}_{824} \end{bmatrix}$

where $x_{m,n}$ denote the flux of $m$th QSO in $n$th wavelength bin, and $\overline{x}_k$ is the mean flux in $k$th wavelength bin.

Then, the covariance matrix is: $\bf C$ $ = \frac{1}{N-1}$ $\bf X_c^T X_c.$ ($N$ is the number of QSOs.)

2. Find the covariance matrix C using the data matrix flux.

In [28]:
X_C  = flux - np.mean(flux, axis=0)
C = 1/(len(flux)-1)*np.matmul(X_C.T, X_C)
C.shape
Out[28]:
(824, 824)

3. Using numpy.linalg, find eigenvalues and eigenvectors of the covariance matrix. Order the eigenvalues from largest to smallest and then plot them as a function of the number of eigenvalues. (Remember that the eigenvector with the highest eigenvalue is the principle component of the data set.) In this case, we find that our covariance matrix is rank-17 matrix, so we only select the first 17 highest eigenvalues and corresponding eigenvectors (other eigenvalues are close to zero).

NOTE: Here, by ranking the eigenvalues based on their magnitudes, you basically rank them in order of significance. You should show that the first few components are dominant, accounting for most of the variability in the data. So you can plot eigenvalues as a function of component number (1,2,3,...,17)

In [29]:
from numpy.linalg import eig
eigenvalues, eigenvectors = np.linalg.eig(C)
sorted_indices = np.argsort(np.abs(eigenvalues))[::-1]
eigenvalues, eigenvectors = eigenvalues[sorted_indices], eigenvectors[sorted_indices]
In [30]:
# Make plot
N_eig = np.arange(len(eigenvalues))
plt.figure(figsize = (3.5,3.5), dpi = 200)
plt.scatter(N_eig[:17], eigenvalues[:17])
plt.ylabel('Eigen Value')
plt.xlabel('Component N')
canvas_ticks(plt)
plt.show()
/usr/local/lib/python3.10/dist-packages/matplotlib/collections.py:192: ComplexWarning: Casting complex values to real discards the imaginary part
  offsets = np.asanyarray(offsets, float)

4. Plot the first three eigenvectors. These eigenvectors represent the principal variations of the spectra with respect to that mean spectrum.

In [31]:
eigenvectors.shape
Out[31]:
(824, 824)
In [32]:
plt.figure(figsize = (8,3.5), dpi = 200)
for i in range(3):
  plt.plot(N_eig, eigenvectors[:,i], lw =1, label = f'$\mathbf{{v_{i}}}$')
# plt.ylabel('Eigen Vectors')
plt.xlabel('Component N')
canvas_ticks(plt)
plt.legend(frameon = False)
plt.show()
/usr/local/lib/python3.10/dist-packages/matplotlib/cbook/__init__.py:1335: ComplexWarning: Casting complex values to real discards the imaginary part
  return np.asarray(x, float)

The eigenvectors indicate the direction of the principal components, so we can re-orient the data onto the new zes by multiplying the original mean-centered data by the eigenvectors. We call the re-oriented data "PC scores." (Call the PC score matrix $\bf Z$) Suppose that we have $k$ eigenvectors. Construct the matrix of eigenvectors $\bf V = [v_1 v_2 ... v_k]$, with $\bf v_i$ the $i$th highest eigenvector. Then, we can get 18 $\times\ k$ PC score matrix by multiplying the 18 $\times$ 824 data matrix with the 824 $\times\ k$ eigenvector matrix:

$$ \bf Z = X_c V $$

Then, we can reconstruct the data by mapping it back to 824 dimensions with $\bf V^T$:

$$ \bf \hat{X} = \boldsymbol \mu + Z V^T $$ where $\boldsymbol \mu$ is the vector of mean QSO flux.

Now, comparing the original data with the reconstructed data, we can calculate the residuals. Let $\bf X_{(i)}, \hat{X}_{(i)}$ denote the rows of $\bf X, \hat{X}$ respectively. Remember that the data matrix has the dimension 18 $\times$ 824, so each row $\bf X_{(i)}$ corresponding the spectra of one particular QSO. (For example, if you wish to see the QSO spectra in row 7, you can plot $\bf X_{(7)}$ as a function of wavelength.). Then, we can simply calculate the residual as $\frac{1}{N} \sum_{i=1}^N \bf |\hat{X}_{(i)} - X_{(i)}|^2$ where $N$ is the total number of QSOs (NOTE: $\bf |\hat{X}_{(i)} - X_{(i)}|$ is the magnitude of the difference between two vectors $\bf \hat{X}_{(i)}$ and $\bf X_{(i)}$.)

5. First, start with only mean flux value $\boldsymbol \mu$ (in this case $\bf \hat{X} = \boldsymbol \mu, V = 0$) and calculate the residual. Then, do the reconstruction using the first two principal eigenvectors $\bf V = [v_1 v_2]$ and calculate the residual. Finally, let $\bf V = [v_1 v_2 ... v_6]$ (the first six principal eigenvectors) and compute the residual.

In [33]:
V = eigenvectors[:,:6]
Z = np.dot(X_C, V)
# X_C.shape
Z.shape, V.T.shape

Xhat = np.mean(flux, axis =0) +np.dot(Z, V.T)
residual = flux - Xhat

def QSO_residuals(num_eigen_vect):
  V = eigenvectors[:,:num_eigen_vect]
  Z = np.dot(X_C, V)
  Xhat = np.mean(flux, axis =0)
  if num_eigen_vect == 0:
    Xhat = np.mean(flux, axis =0)
  else:
    Xhat = np.mean(flux, axis =0) + np.dot(Z, V.T)
  residual = np.sum((flux - Xhat)**2)/ len(flux)
  return np.abs(residual), Xhat

residuals_0eigen, model_0eigen = QSO_residuals(0)
residuals_2eigen, model_2eigen = QSO_residuals(2)
residuals_6eigen, model_6eigen = QSO_residuals(6)

for i in [0,2,6]:
  print(f"The residual for {i} principle eigen values is {QSO_residuals(i)[0]} ")
The residual for 0 principle eigen values is 7.8382003427995715 
The residual for 2 principle eigen values is 4.370432340894141 
The residual for 6 principle eigen values is 3.4558940203861805 
In [33]:
 
In [34]:
model_0eigen.shape
Out[34]:
(824,)

6. For any two QSO spectra, plot the original and reconstructed spectra using the first six principal eigenvectors.

In [35]:
plt.figure(figsize = (10,5), dpi = 200)
# plt.scatter(wavelength,residual[5])
plt.plot(wavelength, flux[0], label = 'Original')
plt.plot(wavelength, model_6eigen[0],label = 'Reconstuction')
plt.ylabel('$f_\lambda$')
plt.xlabel('Wavelength $\AA$')
canvas_ticks(plt)
plt.legend(frameon = False)
plt.show()
In [36]:
plt.figure(figsize = (10,5), dpi = 200)
# plt.scatter(wavelength,residual[5])
plt.plot(wavelength, flux[1], label = 'Original')
plt.plot(wavelength, model_6eigen[1],label = 'Reconstuction')
plt.ylabel('$f_\lambda$')
plt.xlabel('Wavelength $\AA$')
canvas_ticks(plt)
plt.legend(frameon = False)
plt.show()

7. Plot the residual as a function of the number of included eigenvectors (1,2,3,...,17).

In [37]:
np.abs(QSO_residuals(0)[0])
Out[37]:
7.8382003427995715
In [38]:
plt.figure(figsize = (3.5,3.5), dpi = 200)
for i in range(1,18):
  plt.scatter(i, QSO_residuals(i)[0], ls = '-', color = 'k')
plt.ylabel('$f_\lambda$')
plt.xlabel('Eigen Value Components')
canvas_ticks(plt)
# plt.legend(frameon = False)
plt.show()

In this problem, we only have 18 QSO spectra, so the idea of using PCA may seem silly. We can also use SVD to find eigenvalues and eigenvectors. With SVD, we get $\bf X_c = USV^T$. Then, the covariance matrix is $\bf C$ $ = \frac{1}{N-1}$ $\bf X_c^T X_c$ $ = \frac{1}{N-1}$ $\bf VS^2V^T.$ Then, the eigenvalues are the squared singular values scaled by the factor $\frac{1}{N-1}$ and the eigenvectors are the columns of $\bf V$.

8. Find the eigenvalues applying SVD to the mean-centered data matrix $\bf X_c$.

In [39]:
from scipy.linalg import svd

U, S, VT = svd(X_C )
V = VT.T
# Print Eigenvalues
eigenvalues = S**2/(len(flux[:,0])-1)
eigenvalues
Out[39]:
array([3.57183360e+00, 1.16921314e+00, 8.19491399e-01, 6.43886056e-01,
       4.87138879e-01, 3.16043520e-01, 2.72246202e-01, 1.83227778e-01,
       1.41340818e-01, 1.35417557e-01, 1.24349547e-01, 9.68536857e-02,
       8.91735508e-02, 8.05492370e-02, 6.53953675e-02, 5.46083371e-02,
       4.85022745e-02, 2.72391755e-30])

Problem 3 - Back to MNIST¶

In Assignment 2, we used the UMAP module to reduce the MNIST dataset to 2 dimensions (from 784) for easy visualization and observed that different classes (10 digits - "0", "1", ..., "9") got separated nicely into clusters when the MNIST data are embedded into lower dimensions by UMAP.

In this exercise, instead of the UMAP module, we use the PCA method for dimensionality reduction.

As mentioned in Problem 2, PCA is a technique for reducing the number of dimensions in a dataset whilst retaining most information. It is using the correlation between some dimensions and tries to provide a minimum number of variables that keeps the maximum amount of variation or information about how the original data is distributed. It does not do this using guesswork but using hard mathematics and it uses something known as the eigenvalues and eigenvectors of the data-matrix. These eigenvectors of the covariance matrix have the property that they point along the major directions of variation in the data. These are the directions of maximum variation in a dataset. Here, we use the scikit-learn implementation of PCA: https://scikit-learn.org/stable/modules/generated/sklearn.decomposition.PCA.html

First, load the MNIST data:

In [86]:
from sklearn.datasets import fetch_openml
mnist = fetch_openml('mnist_784')
X = mnist.data
Y = mnist.target

X = X.to_numpy()
Y = Y.to_numpy()
/usr/local/lib/python3.10/dist-packages/sklearn/datasets/_openml.py:968: FutureWarning: The default value of `parser` will change from `'liac-arff'` to `'auto'` in 1.4. You can set `parser='auto'` to silence this warning. Therefore, an `ImportError` will be raised from 1.4 if the dataset is dense and pandas is not installed. Note that the pandas parser may return different data types. See the Notes Section in fetch_openml's API doc for details.
  warn(

"$X$" contains information about the given MNIST digits. We have a 28x28 pixel grid, so each image is a vector of length 784; we have 70,000 images (digits), so $X$ is a 70,000x784 matrix. "$Y$" is a label (0-9; the category to which each image belongs) vector of length 70,000.

1. Do the following:

(1) Randomly shuffle data (i.e. randomize the order)

(Note: The label $Y_1$ corresponds to a vector $X_{1j}$, and even after shuffling, $Y_1$ should still correspond to $X_{1j}$.)

(2) Select 1/3 of the data. (You are free to work with a larger set of the data, but it will take much longer time to train.)

(3) Split data into training and test samples using train_test_split (https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html). Set train_size = 0.8. (80% of $X$ is our training samples.) Print the dimension of training and test samples. </i></span>

In [87]:
for i in range(len(Y)):
  Y[i] = float(Y[i])
In [88]:
from sklearn.model_selection import train_test_split
from sklearn.utils import check_random_state

rand = np.arange(len(X))
rand = np.random.permutation(rand)
# Generate shuffled indices
X = X[rand][::3]
Y = Y[rand][::3]

x_train, x_test, y_train,y_test = train_test_split(X,Y, train_size=0.8)

Many machine learning algorithms are also not scale invariant, and hence we need to scale the data (different features to a uniform scale). All this comes under preprocessing the data. (http://scikit-learn.org/stable/modules/preprocessing.html#preprocessing) PCA is a prime example of when such normalization is important; if the variables are not measured on the same scale, then each principal component can be dominated by a single variable.

In this exercise, the MNIST pixel values in images should also be scaled prior to providing the images as an input to PCA. There are three main types of pixel scaling techniques: normalization (scaling pixel to the range 0-1), centering (scale pixel values to have a zero-mean), and standardization (scale pixel values to have a zero-mean and unit-variance).

First, let us try normalization. Each pixel contains a greyscale value quantified by an integer between 0 and 255. To standardize the dataset, we normalize the "$X$" data in the interval [0, 1].

2. Normalize the X data (both training and test).

In [46]:
x_train, x_test  = x_train/255, x_test/255

Next, using scikit-learn's PCA module, we can select the first two principal components from the original 784 dimensions.

In [47]:
from sklearn.decomposition import PCA

(1) Define the PCA model with the first 2 principal components:

  pca = PCA(n_components=2)

(2) Using "fit_transform," fit the model with the training X data and apply the dimensionality reduction on it.

  X_train_PCA = pca.fit_transform(training X data)

(3) With the same model, apply the dimensionality reduction on the test X data.

  X_test_PCA = pca.transform(test X data)

3. This problem is similar to HW2-Q4-Part3. For both training and test samples, create a scatterplot of the first and second principal component and color each of the different types of digits with a different color. Label each axis (e.g. x-axis: 1st principal component, y-axis: 2nd principal component). How does it compare to the UMAP results?

In [48]:
pca = PCA(n_components=2)
X_train_PCA = pca.fit_transform(x_train)
X_test_PCA = pca.transform(x_test)
In [49]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5), dpi = 200)
# Scatter plot for the training data with color bar
scatter_train = ax[0].scatter(X_train_PCA[:, 0], X_train_PCA[:, 1], c=y_train, cmap="tab10")
ax[0].set_title("Training Data")
plt.colorbar(scatter_train, ax=ax[0], pad = 0.01,)
canvas_ticks(ax[0])

# Scatter plot for the test data with color bar
scatter_test = ax[1].scatter(X_test_PCA[:, 0], X_test_PCA[:, 1], c=y_test, cmap="tab10")
ax[1].set_title("Test Data")
plt.colorbar(scatter_test, ax=ax[1],pad = 0.01,)
plt.tight_layout()
canvas_ticks(ax[1])
plt.show()

4. Select the first three principal components and make 3D scatterplot on the training data. (similar to HW2-Q4-Part5)

In [50]:
pca = PCA(n_components=3)
X_train_PCA = pca.fit_transform(x_train)
X_test_PCA = pca.transform(x_test)
In [51]:
from mpl_toolkits.mplot3d import Axes3D
In [64]:
a, b, c = X_train_PCA[:,0], X_train_PCA[:,1], X_train_PCA[:,2]
A, B, C = X_test_PCA[:,0], X_test_PCA[:,1], X_test_PCA[:,2]

# Create a 3D plot
fig = plt.figure(figsize = (8,5), dpi = 200)
ax = fig.add_subplot(111, projection='3d')
surf = ax.scatter(a, b, c, c=y_train, cmap="tab10")
ax.set_title("Training Data")

# Add a color bar
fig.colorbar(surf, ax=ax, pad = 0.1,shrink=0.8)
plt.tight_layout()
plt.show()


# Create a 3D plot
fig = plt.figure(figsize = (8,5), dpi = 200)
ax = fig.add_subplot(111, projection='3d')
surf = ax.scatter(A, B, C, c=y_test, cmap="tab10")
ax.set_title("Test Data")

# Add a color bar
fig.colorbar(surf, ax=ax, pad = 0.1,shrink=0.8)
plt.tight_layout()
plt.show()

From the graph we can see the two or three components definitely hold some information, especially for specific digits, but clearly not enough to set all of them apart. There are other techniques, such as UMAP module or t-SNE (t-Distributed Stochastic Neighbouring Entities), which can better reduce the dimensions for visualisation.


In [65]:
from sklearn.neighbors import KNeighborsClassifier as knn

Now, we will introduce K-nearest neighbors (KNN), one of the most widely used machine learning classification techniques. We use scikit-learn implementation of KNN: https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsRegressor.html

Ideally, we should tune KNN hyperparameters by doing a grid search using k-fold cross validation, but in this exercise we simply use default parameters with n_neighbors = 6.

(1) Define the knn classifier

  clf = knn(n_neighbors=6)

(2) Fit the model

  clf.fit(training X data, training Y/target data)

(3) Get the classification accuracy on the test data

  clf.score(test X data, test Y/target data)

5. Evaluate the classification accuracy on the test data using a KNN classifier.

In [67]:
clf = knn(n_neighbors=6)
clf.fit(x_train, y_train.astype(int))
clf_sore = clf.score(x_test, y_test.astype(int))
print(clf_sore)
0.9588600814227555

The above KNN classifier considers all 784 features for each image when making its decisions. What if you do not need that many? It is possible that a lot of those features do not really affect our predictions that much. Or worse, KNN could be considering feature anomalies that are unique to our training data, resulting in overfitting. One way to deal with this is by removing features that aren’t contributing much.

Now, suppose you take the first two principal components from PCA and fit your model using those two components.

  pca = PCA(n_components=2)

  X_train_PCA = pca.fit_transform(training X data)

  X_test_PCA = pca.transform(test X data)

Now you can take X_train_PCA, along with training Y data, to fit the KNN model and evaluate the classification accuracy.

6. Evaluate the classification accuracy with different number of PCA componenets. Let N_PCA_component = [2, 10, 20, 30, 40, 50, 60, 100, 200, 300, 400, 700]. Plot classification accuracy vs. number of PCA components. How does it compare to the accuracy in Part 5? Draw a horizontal line for the accuracy with all 784 features.

In [56]:
N_PCA_component = [2, 10, 20, 30, 40, 50, 60, 100, 200, 300, 400, 700]
score = []
for i in N_PCA_component:
  pca = PCA(n_components=i)
  X_train_PCA = pca.fit_transform(x_train)
  X_test_PCA = pca.transform(x_test)

  clf = knn(n_neighbors=6)
  clf.fit(X_train_PCA, y_train.astype(int))
  s = clf.score(X_test_PCA, y_test.astype(int))
  score.append(s)
In [68]:
score
Out[68]:
[0.42082708377973,
 0.918577244482537,
 0.9614313263338333,
 0.9697878722948361,
 0.9700021427040926,
 0.9682879794300407,
 0.9689307906578102,
 0.9674308977930148,
 0.9637883008356546,
 0.9610027855153204,
 0.959931433469038,
 0.9588600814227555]
In [69]:
plt.figure(figsize = (3.5,3.5), dpi = 200)
plt.plot(N_PCA_component, score, 'b.-', color = 'k')
plt.axhline(y = clf_sore)
plt.ylabel('Classification Accuracy')
plt.xlabel('N PCA Components')
canvas_ticks(plt)
# plt.legend(frameon = False)
plt.show()
<ipython-input-69-b5ed9f0cc651>:2: UserWarning: color is redundantly defined by the 'color' keyword argument and the fmt string "b.-" (-> color='b'). The keyword argument will take precedence.
  plt.plot(N_PCA_component, score, 'b.-', color = 'k')

7. Instead of the PCA method, fit the UMAP model with the training data and do unsupervised learning. Reduce data to 2 dimensions (embed to 2 dimensions) and train the KNN model on the embedded training data. Compared to Part 6, does it give you a higher classification accuracy even with n_component = 2?

In [70]:
!pip install umap-learn
import umap
Requirement already satisfied: umap-learn in /usr/local/lib/python3.10/dist-packages (0.5.4)
Requirement already satisfied: numpy>=1.17 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (1.23.5)
Requirement already satisfied: scipy>=1.3.1 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (1.11.2)
Requirement already satisfied: scikit-learn>=0.22 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (1.2.2)
Requirement already satisfied: numba>=0.51.2 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (0.56.4)
Requirement already satisfied: pynndescent>=0.5 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (0.5.10)
Requirement already satisfied: tqdm in /usr/local/lib/python3.10/dist-packages (from umap-learn) (4.66.1)
Requirement already satisfied: tbb>=2019.0 in /usr/local/lib/python3.10/dist-packages (from umap-learn) (2021.10.0)
Requirement already satisfied: llvmlite<0.40,>=0.39.0dev0 in /usr/local/lib/python3.10/dist-packages (from numba>=0.51.2->umap-learn) (0.39.1)
Requirement already satisfied: setuptools in /usr/local/lib/python3.10/dist-packages (from numba>=0.51.2->umap-learn) (67.7.2)
Requirement already satisfied: joblib>=0.11 in /usr/local/lib/python3.10/dist-packages (from pynndescent>=0.5->umap-learn) (1.3.2)
Requirement already satisfied: threadpoolctl>=2.0.0 in /usr/local/lib/python3.10/dist-packages (from scikit-learn>=0.22->umap-learn) (3.2.0)
In [85]:
model = umap.UMAP(n_components=2)
embedding_train = model.fit_transform(x_train, y_train)
embedding_test = model.transform(x_test)

clf = knn(n_neighbors=6)
clf.fit(embedding_train, y_train.astype(int))
clf.score(embedding_test, y_test.astype(int))
Out[85]:
0.947075208913649

Instead of pixel normalization, we can also try feature rescaling through standardization (rescaling the features such that they have the properties of a standard normal distribution with a mean of zero and a standard deviation of one). We can use sklearn.preprocessing.StandardScaler for this job.

In [61]:
from sklearn.preprocessing import StandardScaler

(1) Define the StandardScaler

  sc = StandardScaler()

(2) Fit the training X data and then transform it.

  X_train = sc.fit_transform(training X data)

(3) Perform standardization on the test X data.

  X_test = sc.transform(test X data)

8. Re-load the MNIST data (Repeat Part 1) and try standardization on both training and test X data following the above steps. Evaluate the classification accuracy using a KNN classifier. How does it compare to Part 5?

In [91]:
sc = StandardScaler()
X_train = sc.fit_transform(x_train)
X_test = sc.transform(x_test)

clf = knn(n_neighbors=6)
clf.fit(X_train, y_train.astype(int))
clf.score(X_test, y_test.astype(int))
Out[91]:
0.9200771373473323

We did slightly worse now than in part 5, where our accuracy was 95.9% vs now we are getting 92.0% accuracy

9. Again, take the data from Part 8 (standardized X data) and do unsupervised learning using the UMAP module. Reduce the data to 2 dimensions, and plot the embedding as a scatterplot (for the training data) and color by the target array. How does it compare to HW2-Q5f? Which pixel rescaling method do you think works better?

In [98]:
model = umap.UMAP(n_components=2)
embedding_train = model.fit_transform(X_train)
embedding_test = model.transform(X_test)
In [99]:
fig, ax = plt.subplots(1, 2, figsize=(10, 5), dpi = 200)
# Scatter plot for the training data with color bar
scatter_train = ax[0].scatter(embedding_train[:, 0], embedding_train[:, 1],s = 1, c=y_train, cmap="tab10")
ax[0].set_title("Training Data")
plt.colorbar(scatter_train, ax=ax[0], pad = 0.01,)
canvas_ticks(ax[0])

# Scatter plot for the test data with color bar
scatter_test = ax[1].scatter(embedding_test[:, 0], embedding_test[:, 1], s= 1, c=y_test, cmap="tab10")
ax[1].set_title("Test Data")
plt.colorbar(scatter_test, ax=ax[1],pad = 0.01,)
plt.tight_layout()
canvas_ticks(ax[1])
plt.show()

it looks like this rescaling methode performed worse than last week on HW2. Last week there was clear separation between the different groups, but now many groups have outliers. Im not sure why this rescaling would make it worse though and just normalizaiton. Interesting.¶

In [ ]:
!jupyter nbconvert --to html "/content/drive/My Drive/P188_288/P188_288_HW/HW1_188.ipynb"